Goal

Use various classifiers (e.g. l1-penalized and ridge logistic regression, and xgboost), for a hotspot detection model.

Data and model

Geographical levels State and County.

Data Between 2020-05-01 and 2020-08-25, take the

  • 7-day JHU case proportion (incidence rate),
  • One FB surveys (smoothed household survey).

Then, form a covariate matrix by time-lagging these by 0 through 28 days. Additionally, add features by calculating "slopes of JHU IR from the past x days", where \(x \in \{3,6,9,12,15..\}\).

The resulting covariate matrix is roughly 3162 by 193 at the state level, and 52093 by 193 at the county level. We have 953 different counties in our sample.

Response data

  • 1 if 25% increase of JHU IR in the one week period 22-28 days from now, compared to the past 1 week (-1 through -7 days) and the later 1-week period exceeds \(20\).
  • 0 otherwise.

Training and test data

  • We split the data into training/test set at a ratio of 70%/30%, by geographical level.

  • The training and test data were splitted by stratified sampling to produce a similar ratio of 0s and 1s.

  • The training set is used for training the hotspot model (e.g. for cross-validation of glmnet() or training xgb()), and,

  • The test data is used only for creating the ROC or adjusted ROC curves.

Cross-validation folds for lasso & ridge are also formed by stratified sampling, to make sure the ratio of 0s and 1s are similar in training and test data.

Models L1- and L2- penalized logistic regression, SVM, xgboost.

Visualizing state-level data

We first visualize the 1's and 0's in the state level data (highlighting by each CV fold and test data, in different color boxes):

## Load data and parameters
obj_state = get_data(geo_type = "state")
list2env(obj_state, envir = .GlobalEnv)
## <environment: R_GlobalEnv>
## Split training/test.
splitted <- stratified_sample_split_geo(df_model, pct_test = 0.3, seed = 100)

## Also split CV folds.
nfold = 5
foldid <- make_stratified_foldid_geo(splitted$df_train, nfold = nfold, seed = 0)
state_splits = lapply(1:nfold, function(ifold)
  splitted$df_train[which(foldid==ifold),] %>% select(geo_value) %>% unlist() %>% unique())

## Setup the plot.
par(mfrow = c(13, 4)); par(mar = c(3, 3, 1, 1)); par(cex=.7)

## Collect the geo values, in a particular order
train_geos = lapply(1:nfold, function(ifold){
  splitted$df_train[which(foldid==ifold),] %>% select(geo_value) %>% unique()}) %>%
  unlist()
test_geos = splitted$df_test %>% select(geo_value) %>% unlist() %>% unique()
geos = c(train_geos, test_geos)


## Make the individual plots
for(geo in geos){

  ## Form the response data (0's and 1's) for all geos
  dat0 = df_model %>% subset(geo_value==geo) %>%
  select(time_value,
         resp, 
         incidence = "feature_lag0_confirmed_7dav_incidence_prop_indicator-combination" ) 

  ## Combine it with original data for this geo (all time points)
  dat = mat %>% as_tibble() %>% filter(geo_value==geo, signal =="confirmed_7dav_incidence_prop") %>%
    select(time_value, incidence = value) 
  dat_combined = dat %>% full_join(dat0, by = c("time_value", "incidence"))
  dat_combined = dat_combined %>% mutate(resp=replace(resp, is.na(resp), 0))

  ## Make the plot
  dat_combined %>% with(plot(time_value, incidence, col = resp+1, type='o', pch=16, ylim= c(0,100)))
  abline(h = 20, col='grey50', lwd=3)
  legend("topleft", legend=toupper(geo), bty='n', cex=3)

  ## Also add legend for fold id (really bad code)
  splitnum = sapply(state_splits, function(mysplit) geo %in% mysplit) %>% which()
  if(length(splitnum) == 0){
    splitnum = "TEST"
  } else {
    box(lty=1, col=splitnum + 1, lwd=3)
  }
  legend("topright", legend = paste0("\nfold ", splitnum),
         bty="n", cex=2) 
}

We can see that the 0's and 1's are evenly distributed, with somewhat imbalanced class labels -- about 14% of data being 1's and 86% being 0's.

all_tbl = df_model %>% select(resp) %>% table() ##%>% knitr::kable() %>% print()
train_tbl = splitted$df_train %>% select(resp) %>% table() ##%>% knitr::kable() %>% print() 
test_tbl = splitted$df_test %>% select(resp) %>% table() ##%>% knitr::kable() %>% print() 
all_folds_tbl = lapply(1:nfold, function(ifold){
  splitted$df_train[which(foldid==ifold),] %>% select(resp) %>% table() ##%>% knitr::kable() %>% print() 
})
cv_folds_tbl = do.call(cbind, all_folds_tbl)
colnames(cv_folds_tbl) = paste0("fold", 1:nfold)

tab = cbind(all = all_tbl,
            train = train_tbl,
            train = test_tbl,
            cv_folds_tbl)
tab = rbind(tab, "Proportion of 1's"= apply(tab, 2, function(a)a[2]/(a[1]+a[2])))
tab %>% knitr::kable(digits=2) %>% print()
all train train fold1 fold2 fold3 fold4 fold5
0 2998.00 2119.00 879.00 336.00 455.00 476.00 423.00 429.0
1 472.00 331.00 141.00 74.00 89.00 68.00 53.00 47.0
Proportion of 1's 0.14 0.14 0.14 0.18 0.16 0.12 0.11 0.1

Results

First, download and form data:

obj_state = get_data(geo_type = "state")
obj_county = get_data(geo_type = "county")

Save the results:

save(obj_state, file=file.path(outputdir, "state-hotspot-data.Rmd"))
save(obj_county, file=file.path(outputdir, "county-hotspot-data.Rmd"))

Load them:

load(file=file.path(outputdir, "state-hotspot-data.Rmd"))
load(file=file.path(outputdir, "county-hotspot-data.Rmd"))

Now, we fit the hot-spot model at and visualize the model performances.

The performances of three different classifiers can be visualized as receiver operating characteristic (ROC) curves. Here, thick transparent lines are the results using Facebook data, and thin solid lines are those excluding Facebook data (only using JHU data).

Two plots (the first is for state level data, and second is for county level data ) are repeated five times each different random seeds for the training/test split and CV split.

for(iseed in 0:4){
  print(paste0("Random seed" = iseed))
  for(geo_type in c("state", "county")){
    if(geo_type == "state") list2env(obj_state, envir = globalenv())
    if(geo_type == "county") list2env(obj_county, envir = globalenv())
    splitted = stratified_sample_split_geo(df_model, pct_test = 0.3, seed = 100+iseed)
    response = "confirmed_7dav_incidence_prop"
    plots = make_plots(destin = outputdir, splitted, lags, n_ahead, geo_type,
                       response, fn_response_name, threshold, slope, onset,
                       geo_cv_split_seed = 0+iseed)
    print(plots$roc)
  }
}
## [1] "0"

## [1] "1"

## [1] "2"

## [1] "3"

## [1] "4"

Next up: Choosing a summary statistic to measure the advantage in performance using Facebook data, we'll calculate difference in the area under the curve (AUC) between the two versions, for each classifier, as a function of how far ahead hotspots are defined as i.e. for a range of n_ahead, in \(\{10, \cdots, 30\}\).

Extra: diagnostics

We perform some simple diagnostics at the state level.

The predicted probabilities are concentrated near 0 and 1.

list2env(obj_state, envir = globalenv())
## <environment: R_GlobalEnv>
splitted = stratified_sample_split_geo(df_model, pct_test = 0.3, seed = 100)
preds = make_preds(destin = outputdir, splitted, lags, n_ahead, geo_type,
                   response, fn_response_name, threshold, slope, onset,
                   geo_cv_split_seed = 0,
                   include_fb = TRUE)
par(mfrow=c(1,3))
hist(preds %>% select(contains("lasso"))%>% unlist(), col='grey80', xlab = "", main=paste0("lasso predicted probabilities"))
hist(preds %>% select(contains("ridge"))%>% unlist(), col='grey80', xlab = "", main=paste0("ridge predicted probabilities"))
hist(preds %>% select(contains("xgb"))%>% unlist(), col='grey80',   xlab = "", main=paste0("xgb predicted probabilities"))

The predicted probabilities by test label:

par(mfrow=c(1,3))
testmat = cbind(preds=preds, resp= splitted$df_test$resp) %>% as_tibble()
for(model in c("lasso", "ridge", "xgb")){
  res = list("0" = testmat %>% subset(resp==0) %>% select(contains(model)) %>% unlist(),
             "1" = testmat %>% subset(resp==1) %>% select(contains(model)) %>% unlist())
  boxplot(res, xlab = "Test Labels\n(Hotspots are 1's)",
          ylab = "Predicted prob.",
          main = paste0("Test set performance (", model, ")"))
}

Let's see predicted probabilities (thick green lines) in test sets overlaid with the JHU case counts (red points are hot spots).

We'll repeat this five times for analyses with different random seeds for training/test data split and CV fold splits, to try to cover more states.

models = c("lasso", "ridge", "xgb")
predcols = RColorBrewer::brewer.pal(3, "Set2")
names(predcols) = models
for(iseed in 0:4){
  splitted = stratified_sample_split_geo(df_model, pct_test = 0.3, seed = 100+iseed)
  preds = make_preds(destin = outputdir, splitted, lags, n_ahead, geo_type,
                     response, fn_response_name, threshold, slope, onset,
                     geo_cv_split_seed = iseed,
                     include_fb = TRUE)
  res = full_join(splitted$df_test %>% select(geo_value, time_value, resp, val=contains("lag0_confirmed")),
                  preds) %>% as_tibble()
  splitted$df_test %>% select(time_value) %>% summary()
  preds %>% select(time_value) %>% summary()
  par(mfrow=c(8, 2))
  par(mar=c(3,4,1,1))
  par(oma=c(0,0,10,0))
  one_geo_diagnose <- function(df,...){
    plot(y=df$val, x=df$time_value, type='o',
         col=df$resp+1, pch=16,ylim=c(0,70),
         xlab = "Time",
         ylab="JHU Household \n Case Proportions")
    abline(h=seq(from=0,to=200, by=5), col='grey80', lty=2)
    thisgeo = df$geo_value %>% unlist() %>% unique()%>%toupper()
    legend("topleft", legend = thisgeo, bty="n", cex=2)
    for(model in models){
      lines(x=df$time_value, y=(df%>%select(contains(model))%>%unlist())*10, col=predcols[model], lwd=2)
    }
    ## if(thisgeo=="AK"){
    ## } 
    return()
  }
  res %>%
    group_by(geo_value) %>%
    group_map(one_geo_diagnose, keep=TRUE) -> dummy_obj
  mtext(paste0("Random seed=", iseed),
        side = 3, outer=TRUE)
  legend("topright", lwd=2, col=predcols,
         legend=paste0("Predicted prob (x10) of ", models), bg="white")
}